# loading data
kang_seed <- read.table(here("data/sev208_kratseedbank_20120213.txt"), sep=',', header = TRUE) |>
mutate(mnd = as.factor(mnd)) |>
filter(!(species %in% c("soil", "plant", "dist", "litter", "gravel")))
gg_miss_var(kang_seed) +
labs(caption = "Figure 1: Shows that there is no missing data in the dataset.",
title = "Missing Data Plot") +
theme_bw() +
theme(plot.caption = element_text(hjust = 0))
skim(kang_seed)
| Name | kang_seed |
| Number of rows | 960 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| dir | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| loc | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
| species | 0 | 1 | 4 | 6 | 0 | 8 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| mnd | 0 | 1 | FALSE | 10 | 18: 128, 23: 128, 25: 128, 6: 96 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| seeds | 0 | 1 | 11.1 | 48.21 | 0 | 0 | 0 | 3 | 656 | ▇▁▁▁▁ |
# visualize counts of each number of seeds
ggplot(kang_seed, aes(x = seeds)) +
geom_histogram(bins = 17) +
theme_bw() +
labs(title = "Count for Each Number of Seeds", caption = "Figure 2: Shows the spread of the data, a large portion of the dataset is zeros.
This will impact whichever test we decide to run.") +
theme(plot.caption = element_text(hjust = 0))
# set up data for bar graph
kang_seed_bar <- kang_seed |>
group_by(mnd) |> # group by mound
summarise(across(seeds, sum)) |> # sum seed counts for each mound
ungroup() # ungroup
# bar graph of total seed counts
ggplot() +
geom_bar(data = kang_seed_bar, aes(x = mnd, y = seeds, fill = mnd),
stat = "identity", show.legend = F) +
theme_bw() +
labs(x = "Mound Number", y = "Number of Seeds",
title = "Total Seed Count per Mound", caption = "Figure 3: Shows the total seed count for each mound location. There appear to
be differences in at least two mounds' total seed counts.") +
theme(plot.caption = element_text(hjust = 0))
Question: How does total seed number differ between kangaroo rat mound locations?
We will start by attempting to run an ANOVA test to determine if there is a difference in the total seed count between kangaroo rat mound locations. We chose this test to start because in the methods of the context paper they ran this test to compare the total seed counts.
\(H_0:\) There is no significant
difference in seed counts between the kangaroo rat mound
locations.
\(H_a:\) There is a significant
difference in seed counts between at least two of the kangaroo mound
locations.
# make aov object
kang_aov <- aov(seeds ~ mnd, data = kang_seed)
# get resiudals
kang_res <- kang_aov$residuals
# check normality
qqPlot(kang_res, xlab = "Norm Quantiles", ylab = "Seeds Residuals", main = "QQPlot to Check Seeds Normality")
## [1] 899 195
The data does not appear to be normal, which we expected because in the context reading they discussed how in their analysis the data was non-normal even after trying transformations. We decided to try a log transformation regardless because this would be the logical next step in this analysis.
# transform data
kang_seed_log <- kang_seed |>
mutate(seeds = log(seeds + 1))
# make new aov object
kang_log_aov <- aov(seeds ~ mnd, data = kang_seed_log)
# residuals
kang_log_res <- kang_log_aov$residuals
#test normality
qqPlot(kang_log_res, xlab = "Norm Quantiles", ylab = "Log(Seeds) Residuals", main = "QQPlot to Check Log(Seeds) Normality")
## [1] 899 195
The data is still non-normal as expected (we stated above that the context paper mentioned how the data is not normal after all transformations).
Due to this we are going to run a Kruskal-Wallis test, for which we meet all assumptions (there are categorical predictor variables, there are independent samples, and each group has at least five observations) and works with count data. This will change our null and alternative hypothesis to be:
\(H_0:\) There is no difference in
the median total seed number between the kangaroo rat mound
locations.
\(H_a:\) There is a difference in at
least two kangaroo rat mound locations median total seed number.
# show aov summary
kruskal.test(seeds ~ mnd, data = kang_seed)
##
## Kruskal-Wallis rank sum test
##
## data: seeds by mnd
## Kruskal-Wallis chi-squared = 16.997, df = 9, p-value = 0.04876
After running the Kruskal-Wallis test, we reject the null hypothesis. (See results).
We used a Kruskal-Wallis test to answer the question, How does the total seed number differ between kangaroo rat mound locations? The test revealed a p-value of \(0.04876\), on \(9\) degrees of freedom which shows a significant difference in the median total seed number among at least two kangaroo rat mound locations. This difference in median values provides valuable insights into the total seed variation. The general null hypothesis of the Kruskal-Wallis assumes that there is no difference among the population medians of the groups being compared. Essentially, it tests whether the samples from different groups are likely to come from the same population or if there are systematic differences between the groups. We found that there are systematic differences between the mounds, which implies that the seeds observed in one mound are not randomly distributed among all the mounds but are tied specifically to the individual mound where they were counted.
This provides significant evidence that figure 3 (provided below) is an accurate way of visualizing the difference in total seed count between kangaroo rat mounds.
# bar graph of total seed counts
ggplot() +
geom_bar(data = kang_seed_bar, aes(x = mnd, y = seeds, fill = mnd),
stat = "identity", show.legend = F) +
theme_bw() +
labs(x = "Mound Number", y = "Number of Seeds",
title = "Total Seed Count per Mound", caption = "Figure 3: Shows the total seed count for each mound location. There appear to
be differences in at least two mounds' total seed counts.") +
theme(plot.caption = element_text(hjust = 0))
There is an apparent difference in total seed count between mounds shown in this bar plot. Mound 19 has less than 500 seeds while mound 27 has over 1500 total seeds. Mounds 6, 11, 25 have about 1200 seeds and 9 and 23 have around 750 seeds. There seem to similarities for subsets of mounds but as a general consensus there is a significant difference between in total seed number between kangaroo rat mounds.
The precise data follows:
| Mound Number | Number of Seeds |
|---|---|
| 6 | 1164 |
| 9 | 739 |
| 11 | 1181 |
| 18 | 1407 |
| 19 | 394 |
| 23 | 752 |
| 24 | 873 |
| 25 | 1190 |
| 27 | 1574 |
| 29 | 1384 |
Northern peatlands, an important long-term sink of carbon, have recently been affected by drier conditions caused by climate change, leading to a phenomenon called “shrubification” (Moore). In arctic conditions, this shrubification affects “climates feedbacks including land surface albedo and enhanced evapotranspiriation” (Crump). This study makes use of Niwot Ridge data in Colorado to investigate the role of shrubification on plant reproductive success, determined through seed count, as well as the role other factors like species type or the total number of inflorescences (Seaver). This is an important subject to study because it can help see highlight the reproductive success of these shrubs, and it turn see if it will be of concern for albedo and evapotranspiration, which ultimately affects the health of the broader peatland ecosystem. We hypothesize that site type will not be a significant predictor of the measured number of seeds, but species type and total number of inflorescenes will be.
When loading the data, we only selected for treatment, species, total_inf, num_seeds, and shrub_num, since these are the only variables that we are interested in. As mentioned in the introduction, this data was provided by Micaela M Seaver and the Niwot Ridge LTER.
# loading data
shrub_raw <- read.csv(here("data/shrubstudy_seed.csv"))
shrub_seed <- shrub_raw |>
dplyr::select(treatment, species, total_inf = total_nr_infl,
num_seeds = nr_seeds, shrub_num) |>
mutate(treatment = as.factor(treatment))
# visualize missing data
gg_miss_var(shrub_seed)
This plot shows us that there are a lot of missing values for the num_seeds variable. Moving forward, we will remove these missing values.
skim(shrub_seed)
| Name | shrub_seed |
| Number of rows | 287 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| factor | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| species | 0 | 1 | 6 | 6 | 0 | 19 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| treatment | 0 | 1 | FALSE | 2 | con: 174, shr: 113 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| total_inf | 0 | 1.00 | 5.15 | 10.45 | 1 | 1.00 | 1 | 5.00 | 117 | ▇▁▁▁▁ |
| num_seeds | 105 | 0.63 | 14.55 | 28.62 | 0 | 1.25 | 5 | 13.75 | 285 | ▇▁▁▁▁ |
| shrub_num | 0 | 1.00 | 29.53 | 16.04 | 5 | 13.00 | 30 | 44.00 | 54 | ▆▁▇▂▅ |
# filter out missing data
shrub_seed_sub <- shrub_seed |>
drop_na(num_seeds)
ggplot(shrub_seed_sub, aes(x = num_seeds)) +
geom_histogram(bins = 17)
This histogram shows us that our data includes a lot of zeros.
ggplot(shrub_seed_sub, aes(x = species)) +
geom_bar(stat = 'count')
One thing to note from this plot is that TRIDAS appears to only have one sample. Because of this, in the future, the predictions are likely to be all over the place and have a very large confidence interval. In order to increase the interpretability of the following plots, we will remove TRIDAS.
# filter out missing data
shrub_seed_sub <- shrub_seed_sub |>
subset(species != "TRIDAS")
shrub_seed_sub |>
dplyr::select(!num_seeds) |>
ggpairs()
Question: How does seed count vary with plot type (shrub or open), plant species, and total number of inflorescences? Is there a simpler model that explains seed count, and if so, what is it?
\(H0:\) Plot type, plant species, and total number of inflorescences has no effect on seed count. \(H1:\) Plot type, plant species, and total number of inflorescences has a significant effect on seed count.
In this section, we will be making use of the glmmTMB package to create generalized linear models, and the DHARMA package as a way to check the diagnostics of these models.
# simple models
shrub_null <- lm(num_seeds ~ 1, data = shrub_seed_sub)
shrub_treatment <- lm(num_seeds ~ treatment, data = shrub_seed_sub)
shrub_species <- lm(num_seeds ~ species, data = shrub_seed_sub)
shrub_inf <- lm(num_seeds ~ total_inf, data = shrub_seed_sub)
# linear model, we know this is wrong
shrub_model1 <- lm(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub)
# generalized linear model with Poisson distribution
shrub_model2 <- glm(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub, family = "poisson")
# generalized linear model with negative binomial distribution
shrub_model3 <- glmmTMB(num_seeds ~ treatment + species + total_inf, data = shrub_seed_sub, family = "nbinom2")
# generalized linear model with Poisson distribution and random effect of treatment
shrub_model4 <- glmmTMB(num_seeds ~ treatment + species + total_inf + (1|shrub_num),
data = shrub_seed_sub, family = "poisson")
# generalized linear model with negative binomial distribution and random effect of treatment
shrub_model5 <- glmmTMB(num_seeds ~ treatment + species + total_inf + (1|shrub_num),
data = shrub_seed_sub, family = "nbinom2")
Because we are looking at count data, we know that the data is discrete and only has a lower bound. Knowing this, we built a couple different models using the Poisson and Negative Binomial distribution.
# check diagnostics
simulationOutput_m1 <- simulateResiduals(shrub_model1)
plot(simulationOutput_m1)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m1)
testZeroInflation(simulationOutput_m1)
# check diagnostics
simulationOutput_m2 <- simulateResiduals(shrub_model2)
plot(simulationOutput_m2)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m2)
testZeroInflation(simulationOutput_m2)
simulationOutput_m3 <- simulateResiduals(shrub_model3)
plot(simulationOutput_m3)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m3)
testZeroInflation(simulationOutput_m3)
# check diagnostics
simulationOutput_m4 <- simulateResiduals(shrub_model4)
plot(simulationOutput_m4)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m4)
testZeroInflation(simulationOutput_m4)
# check diagnostics
simulationOutput_m5 <- simulateResiduals(shrub_model5)
plot(simulationOutput_m5)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m5)
testZeroInflation(simulationOutput_m5)
| Model | Formula | Distribution | QQ Plot Residuals | Residual vs. Predicted | Over/Under Dispersion | Zeros |
|---|---|---|---|---|---|---|
| shrub_model1 | num_seeds ~ treatment + species + total_inf | Normal | F | F | Under | Too many |
| shrub_model2 | num_seeds ~ treatment + species + total_inf | Poisson | F | F | Over | Too many |
| shrub_model3 | num_seeds ~ treatment + species + total_inf | Negative Binomial | P | F | None | Too many |
| shrub_model4 | num_seeds ~ treatment + species + total_inf + (1|shrub_num) | Poisson | F | F | Over | Too many |
| shrub_model5 | num_seeds ~ treatment + species + total_inf + (1|shrub_num) | Negative Binomial | P | F | None | Too many |
Based on the tests above, it appears that either shrub_model3 or shrub_model5 will be the best models because neither are over or under-dispered. However, they do still have some patterns in the residuals, and they appear to be zero-inflated. To further our model, we will create a new model with a zero-inflated term.
# generalized linear model with negative binomial distribution
shrub_model6 <- glmmTMB(num_seeds ~ treatment + species + total_inf, ziformula = ~1,
data = shrub_seed_sub, family = "nbinom2")
With the zero-inflated model created, we will now run the necessary diagnostics.
# check diagnostics
simulationOutput_m6 <- simulateResiduals(shrub_model6)
plot(simulationOutput_m6)
par(mfrow = c(1,2))
testDispersion(simulationOutput_m6)
testZeroInflation(simulationOutput_m6)
Based on the above plots, shrub_model6 appears to be our best model. There is no over or underdispersion, and our distribution has about the same number of zeros as our model would predict. There is still some trend in our residuals, but it appears to have been lessened when compared to shrub_model3. We acknowledge this trend in residuals and have decided that isn’t a big cause for concern since the other diagnostics were met.
Moving forward, we will check AICC values to test whether our assumptions about the models are correct.
MuMIn::model.sel(shrub_null, shrub_treatment, shrub_inf, shrub_species, shrub_model1, shrub_model2, shrub_model3, shrub_model4, shrub_model5, shrub_model6)
## Model selection table
## (Intr) trt ttl_inf spc cnd((Int)) dsp((Int)) cnd(spc)
## shrub_model6 2.156 + +
## shrub_model3 1.917 + +
## shrub_model5 1.915 + +
## shrub_model1 -2.5670 + 2.15000 +
## shrub_inf -0.5566 2.09100
## shrub_species 36.2900 +
## shrub_null 14.4300
## shrub_treatment 16.1300 +
## shrub_model4 2.488 + +
## shrub_model2 2.5490 + 0.02989 +
## cnd(ttl_inf) cnd(trt) zi((Int)) family class random df
## shrub_model6 0.06634 + -2.283 n2(lg) glmmTMB 9
## shrub_model3 0.07502 + n2(lg) glmmTMB 8
## shrub_model5 0.07505 + n2(lg) glmmTMB c(s_n) 9
## shrub_model1 gs(id) lm 8
## shrub_inf gs(id) lm 3
## shrub_species gs(id) lm 6
## shrub_null gs(id) lm 2
## shrub_treatment gs(id) lm 3
## shrub_model4 0.03145 + ps(lg) glmmTMB c(s_n) 8
## shrub_model2 ps(lg) glm 7
## logLik AICc delta weight
## shrub_model6 -548.867 1116.8 0.00 0.852
## shrub_model3 -552.052 1120.9 4.15 0.107
## shrub_model5 -551.884 1122.8 6.03 0.042
## shrub_model1 -689.162 1395.2 278.37 0.000
## shrub_inf -695.263 1396.7 279.88 0.000
## shrub_species -849.514 1711.5 594.72 0.000
## shrub_null -863.595 1731.3 614.47 0.000
## shrub_treatment -863.096 1732.3 615.54 0.000
## shrub_model4 -1055.923 2128.7 1011.90 0.000
## shrub_model2 -1147.380 2309.4 1192.62 0.000
## Abbreviations:
## family: gs(id) = 'gaussian(identity)', n2(lg) = 'nbinom2(log)',
## ps(lg) = 'poisson(log)'
## Models ranked by AICc(x)
## Random terms:
## c(s_n): cond(1 | shrub_num)
This chart confirms our assumptions about our models; shrub_model6 is best, followed by shrub_model3 and shrub_model5. We know this because of their respective AICC values of 1127.6, 1132.0, and 1133.9. In general, a lower AICC value equates to a better model.
Additionally, we can see that our simpler models aren’t performing as well compared to our more complex models.
Next, we will look closer at our model: shrub_model6.
# summary
summary(shrub_model6)
## Family: nbinom2 ( log )
## Formula: num_seeds ~ treatment + species + total_inf
## Zero inflation: ~1
## Data: shrub_seed_sub
##
## AIC BIC logLik deviance df.resid
## 1115.7 1144.5 -548.9 1097.7 172
##
##
## Dispersion parameter for nbinom2 family (): 2.18
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.155988 0.212883 10.128 < 2e-16 ***
## treatmentshrub -0.299250 0.131830 -2.270 0.0232 *
## speciesCARRUP -1.718777 0.281984 -6.095 1.09e-09 ***
## speciesGEUROS -0.238970 0.242542 -0.985 0.3245
## speciesKOBMYO 0.046928 0.201021 0.233 0.8154
## speciesMINOBT -0.416711 0.212101 -1.965 0.0495 *
## total_inf 0.066343 0.007516 8.827 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2827 0.3991 -5.719 1.07e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of our model shows that treatmentshrub, speciesCARRUP, speciesMINOBT, speciesTRIDAS, and total_inf all appears to be significant indicators of total number of seeds.
# confidence intervals
confint(shrub_model6)
## 2.5 % 97.5 % Estimate
## cond.(Intercept) 1.7387449 2.573230737 2.15598779
## cond.treatmentshrub -0.5576313 -0.040868234 -0.29924976
## cond.speciesCARRUP -2.2714556 -1.166099126 -1.71877737
## cond.speciesGEUROS -0.7143430 0.236403207 -0.23896992
## cond.speciesKOBMYO -0.3470654 0.440921896 0.04692825
## cond.speciesMINOBT -0.8324210 -0.001001714 -0.41671134
## cond.total_inf 0.0516118 0.081073969 0.06634289
## zi.(Intercept) -3.0649895 -1.500469125 -2.28272929
# adjusted R2
r.squaredGLMM(shrub_model6)
## R2m R2c
## delta 0.7225478 0.7225478
## lognormal 0.7643760 0.7643760
## trigamma 0.6655456 0.6655456
# model object in table
shrub_model6 |>
as_flextable() |>
set_caption(
as_paragraph(
as_chunk("Table 1: This table displays what we studied earlier in the summary,
just in an easier to digest format.
",
props = fp_text_default(font.family = "Arial"))
),
word_stylename = "Table Caption"
)
Estimate | Standard Error | statistic | p-value | ||||
|---|---|---|---|---|---|---|---|
Fixed effects | |||||||
(Intercept) | 2.156 | 0.213 | 10.128 | 0.0000 | *** | ||
treatmentshrub | -0.299 | 0.132 | -2.270 | 0.0232 | * | ||
speciesCARRUP | -1.719 | 0.282 | -6.095 | 0.0000 | *** | ||
speciesGEUROS | -0.239 | 0.243 | -0.985 | 0.3245 |
| ||
speciesKOBMYO | 0.047 | 0.201 | 0.233 | 0.8154 |
| ||
speciesMINOBT | -0.417 | 0.212 | -1.965 | 0.0495 | * | ||
total_inf | 0.066 | 0.008 | 8.827 | 0.0000 | *** | ||
(Intercept) | -2.283 | 0.399 | -5.719 | 0.0000 | *** | ||
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||||
square root of the estimated residual variance: 2.2 | |||||||
data's log-likelihood under the model: -548.9 | |||||||
Akaike Information Criterion: 1,115.7 | |||||||
Bayesian Information Criterion: 1,144.5 | |||||||
# Compute predictions
predictions <- ggpredict(shrub_model6, terms = c("treatment", "species", "total_inf"))
# Plot showing seed differences by Species in the ACTUAL DATA
ggplot(shrub_seed_sub, aes(x = treatment, y = num_seeds, fill = treatment)) +
geom_boxplot(aes(color = treatment), alpha = 0.5) +
facet_wrap(~ species) +
scale_fill_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
scale_color_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
theme_bw() +
labs(
x = "",
y = "Number of Seeds",
title = "Measured Number of Seeds per Species",
caption = "Figure X: This figure compares the measured number of seeds per species. The species AREFEN appears
to have the highest number of seeds, whereas CARRUP appears to have the lowest number of
seeds. The AREFEN species has largest spread, likely because it was the species with the
least amount of samples (apart from TRIDAS, which we removed)."
) +
coord_cartesian(ylim = c(0, 70)) +
theme(plot.caption = element_text(hjust=0),
legend.position = 'none')
# Predictions
pred <- ggpredict(shrub_model6, terms = "species", back.transform = TRUE)
# Plot showing seed differences by Species
plot(pred) +
labs(
x = "",
y = "Number of Seeds",
title = "Comparing Estimated Number of Seeds per Species",
caption = "Figure 1: This figure compares the estimated number of seeds per species. The species AREFEN and
KOBMYO appears to have the highest number of seeds, whereas CARRUP appears to have
the lowest number of seeds. The AREFEN species has largest spread, likely because it
was the species with the least amount of samples. These estimated results are very
similar to what we saw in the measured data."
) +
theme(plot.caption = element_text(hjust=0))
# Viz showing measured seed difference by plot type
plottype1 <- ggplot(shrub_seed_sub, aes(x = treatment, y = num_seeds, fill = treatment)) +
geom_boxplot(aes(color = treatment), alpha = 0.5) +
scale_fill_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
scale_color_manual(values = c("control" = "lightblue", "shrub" = "salmon")) +
theme_bw() +
labs(
x = "Species",
y = "Number of Seeds",
title = "Comparing Measured Number of Seeds per Species"
) +
coord_cartesian(ylim = c(0, 40)) +
theme(legend.position = "none")
# Predictions
pred2 <- ggpredict(shrub_model6, terms = "treatment", back.transform = TRUE)
# Viz showing estimated seed difference by plot type
plottype2 <- plot(pred2) +
labs(
x = "Plot Type",
y = "Number of Seeds",
title = "Comparing Estimated Number of Seeds per Plot Type"
) +
coord_cartesian(ylim = c(0, 40))
# making viz side by side
plottype_fig <- ggarrange(plottype1 + rremove("xlab") + rremove('ylab'),
plottype2 + rremove("xlab") + rremove('ylab'),
ncol = 2,
labels = c('A.', 'B.', 'C.', 'D.', 'E.'), # adding figure labels in top left of plot
vjust = 5, # adjusting figure label position
hjust = -4)
annotate_figure(plottype_fig,
left = textGrob("Number of Seeds", # add universal y-axis
rot = 90, vjust = 0.5, # styling
gp = gpar(cex = 1.3)),
bottom = textGrob("",
gp = gpar(cex = 1.3))) +
labs(
caption = "
Figure 2: This figure shows the difference in the number of seeds per plot type, for both the
measured data, as well as the data estimated from our model. The model manages
to predict which plot results in higher seed count, but it fails to capture the
variability."
) +
theme(plot.caption = element_text(hjust=0))
inf1 <- ggplot(shrub_seed_sub, aes(x = total_inf, y = num_seeds)) +
geom_point() +
theme_bw() +
labs(
x = "Number of inflorescences",
y = "Number of Seeds",
title = "Comparing Measured Number of Seeds to Number of inflorescences"
) +
coord_cartesian(xlim = c(0, 60),
ylim = c(0, 500))
# Predictions
pred3 <- ggpredict(shrub_model6, terms = "total_inf", back.transform = TRUE)
# Plot showing seed differences by Species
inf2 <- plot(pred3) +
labs(
x = "Number of inflorescences",
y = "Number of Seeds",
title = "Comparing Estimated Number of Seeds to Number of inflorescences"
) +
coord_cartesian(xlim = c(0, 60),
ylim = c(0, 500))
# making viz side by side
inf_fig <- ggarrange(inf1 + rremove("xlab") + rremove('ylab'),
inf2 + rremove("xlab") + rremove('ylab'),
ncol = 2,
labels = c('A.', 'B.', 'C.', 'D.', 'E.'), # adding figure labels in top left of plot
vjust = 5, # adjusting figure label position
hjust = -4)
annotate_figure(inf_fig,
left = textGrob("Number of Seeds", # add universal y-axis
rot = 90, vjust = 0.5, # styling
gp = gpar(cex = 1.3)),
bottom = textGrob("Number of inflorescences",
gp = gpar(cex = 1.3))) +
labs(
caption = "
Figure 3: This figure shows how the number of seeds varies as the number of inflorescences
increases. The model appears to do a good job at predicting when you compare it
to the measured data."
) +
theme(plot.caption = element_text(hjust=0))
To answer the question we are studying, how does seed count vary with plot type (shrub or open), plant species, and total number of inflorescences, we need to look at the effects of each of the variables.
The first pair of plots show that different species result in a wide variety in the seed count. In general, our model tends to underestimate the number of seeds for all of the species. However, it does correctly predict which species result in a larger number of seeds or a smaller number of seeds, which we believe to be the more valuable insight.
The second pair of plots show that the number of seeds is generally slightly higher in the open (control) plot. Referencing back to the flex table, we know that this is a significant difference in our model, despite being relatively small.
Lastly, the third pair of plots show that the number of seeds increases as the number of total inflorescences increases. Because of the lack of samples with high inflorescences, the confidence interval grows significantly on the latter part of the plot. However, where we have a lot of data points, the model predicts the number of seeds very well.
Referring back to what we saw in these visualizations and the flex table, we can see that all of the variables – species, plot type, and the number of inflorescences – ended up being significant predictors of the total number of seeds. In conclusion, our hypotheses about plot type was incorrect, but our hypotheses about species and number of inflorescences were correct.
{r. include = FALSE, message = FALSE, echo = FALSE, eval = FALSE} #cite packages c("tidyverse", "here", "janitor", "ggeffects", "performance", "naniar", "flextable", "car", "broom", "corrplot", "AICcmodavg", "MASS", "lme4", "MuMIn", "lmtest", "DHARMa", "ggeffects", "lmtest", "skimr", "GGally", "glmmTMB", "broom.mixed", "effects", "ggeffects", "ggpubr", "grid", "purrr") |> map(citation) |> print(style = "text")
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K,
Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.”
Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686
https://doi.org/10.21105/joss.01686.
Müller K (2020). here: A Simpler Way to Find Your Files. R package version 1.0.1, https://CRAN.R-project.org/package=here.
Firke S (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0, https://CRAN.R-project.org/package=janitor.
Lüdecke D (2018). “ggeffects: Tidy Data Frames of Marginal Effects from Regression Models.” Journal of Open Source Software, 3(26), 772. doi:10.21105/joss.00772 https://doi.org/10.21105/joss.00772.
Lüdecke D, Ben-Shachar M, Patil I, Waggoner P, Makowski D (2021). “performance: An R Package for Assessment, Comparison and Testing of Statistical Models.” Journal of Open Source Software, 6(60), 3139. doi:10.21105/joss.03139 https://doi.org/10.21105/joss.03139.
Tierney N, Cook D (2023). “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software, 105(7), 1-31. doi:10.18637/jss.v105.i07 https://doi.org/10.18637/jss.v105.i07.
Gohel D, Skintzos P (2023). flextable: Functions for Tabular Reporting. R package version 0.9.1, https://CRAN.R-project.org/package=flextable.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Robinson D, Hayes A, Couch S (2022). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.1, https://CRAN.R-project.org/package=broom.
Wei T, Simko V (2021). R package ‘corrplot’: Visualization of a Correlation Matrix. (Version 0.92), https://github.com/taiyun/corrplot.
Mazerolle MJ (2023). AICcmodavg: Model selection and multimodel inference based on (Q)AIC(c). R package version 2.3.2, https://cran.r-project.org/package=AICcmodavg.
Venables WN, Ripley BD (2002). Modern Applied Statistics with S, Fourth edition. Springer, New York. ISBN 0-387-95457-0, https://www.stats.ox.ac.uk/pub/MASS4/.
Bates D, Mächler M, Bolker B, Walker S (2015). “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01 https://doi.org/10.18637/jss.v067.i01.
Bartoń K (2023). MuMIn: Multi-Model Inference. R package version 1.47.5, https://CRAN.R-project.org/package=MuMIn.
Zeileis A, Hothorn T (2002). “Diagnostic Checking in Regression Relationships.” R News, 2(3), 7-10. https://CRAN.R-project.org/doc/Rnews/.
Hartig F (2022). DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models. R package version 0.4.6, https://CRAN.R-project.org/package=DHARMa.
Lüdecke D (2018). “ggeffects: Tidy Data Frames of Marginal Effects from Regression Models.” Journal of Open Source Software, 3(26), 772. doi:10.21105/joss.00772 https://doi.org/10.21105/joss.00772.
Zeileis A, Hothorn T (2002). “Diagnostic Checking in Regression Relationships.” R News, 2(3), 7-10. https://CRAN.R-project.org/doc/Rnews/.
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2022). skimr: Compact and Flexible Summaries of Data. R package version 2.1.5, https://CRAN.R-project.org/package=skimr.
Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.
Brooks ME, Kristensen K, van Benthem KJ, Magnusson A, Berg CW, Nielsen A, Skaug HJ, Maechler M, Bolker BM (2017). “glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling.” The R Journal, 9(2), 378-400. doi:10.32614/RJ-2017-066 https://doi.org/10.32614/RJ-2017-066.
Bolker B, Robinson D (2022). broom.mixed: Tidying Methods for Mixed Models. R package version 0.2.9.4, https://CRAN.R-project.org/package=broom.mixed.
Fox J, Weisberg S (2019). An R Companion to Applied Regression, 3rd edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/index.html.
Fox J, Weisberg S (2018). “Visualizing Fit and Lack of Fit in Complex Regression Models with Predictor Effect Plots and Partial Residuals.” Journal of Statistical Software, 87(9), 1-27. doi:10.18637/jss.v087.i09 https://doi.org/10.18637/jss.v087.i09.
Fox J (2003). “Effect Displays in R for Generalised Linear Models.” Journal of Statistical Software, 8(15), 1-27. doi:10.18637/jss.v008.i15 https://doi.org/10.18637/jss.v008.i15.
Fox J, Hong J (2009). “Effect Displays in R for Multinomial and Proportional-Odds Logit Models: Extensions to the effects Package.” Journal of Statistical Software, 32(1), 1-24. doi:10.18637/jss.v032.i01 https://doi.org/10.18637/jss.v032.i01.
Lüdecke D (2018). “ggeffects: Tidy Data Frames of Marginal Effects from Regression Models.” Journal of Open Source Software, 3(26), 772. doi:10.21105/joss.00772 https://doi.org/10.21105/joss.00772.
Kassambara A (2023). ggpubr: ‘ggplot2’ Based Publication Ready Plots. R package version 0.6.0, https://CRAN.R-project.org/package=ggpubr.
R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Wickham H, Henry L (2023). purrr: Functional Programming Tools. R package version 1.0.1, https://CRAN.R-project.org/package=purrr.
Crump, Sarah E. et al. “Arctic Shrub Colonization Lagged Peak Postglacial Warmth: Molecular Evidence in Lake Sediment from Arctic Canada.” Global change biology 25.12 (2019): 4244–4256. Web.
Moore, Paul A. et al. “Examining the Peatland Shrubification‐evapotranspiration Feedback Following Multi‐decadal Water Table Manipulation.” Hydrological processes 36.11 (2022): n. pag. Web.
Seaver, M. 2022. Individual and community flowering phenology, seed counts and pollinator visitation rates in shrub and open plots across Niwot Ridge, 2019 - 2021. ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/edc4ab2faf7dd96cd1deac1544d5f2b9 (Accessed 2023-06-13).